Introduction¶
The objective of this project is to apply data analysis techniques by adopting a data-driven approach for the launch of a hypothetical new mobile application.
For this analysis, two data sources are available. The first is a comprehensive dataset of applications from the Play Store that provides detailed information about each app: from ratings to reviews, from installation counts to prices, including technical specifications. The second source is a collection of user reviews, already processed using sentiment analysis techniques.
Starting from the foundations, I have established a robust framework for data management, with particular attention to cleaning and preparation.
A particularly interesting aspect of the analysis is the examination of competition across different categories. Some categories might appear attractive at first glance, perhaps due to high download numbers, but could prove to be saturated markets with intense competition. Other categories, seemingly smaller, might conceal valuable niches with less competition and, potentially, users more willing to pay for quality products.
The code has been structured following a modular approach, with particular emphasis on clarity and documentation. Each phase of the analysis is organized into well-defined logical components, allowing readers to easily follow the analytical process from raw data to final conclusions. This structure not only makes the code more robust but also facilitates the verification and validation of results obtained at each stage of the analysis.
1. Imports and setup¶
The following code cell prepares the working environment for data analysis. It handles library imports and configures the analysis environment.
The libraries used are:
warning: to manage and suppress warning messages that might clutter the analysis output;logging: to monitor the various phases of analysis and capture errors or unexpected behaviors during data processing;typing: enables specification of expected data types for variables and functions, enhancing code robustness and self-documentation. I used it to clearly define function inputs and outputs, such asDictfor dictionaries orOptionalfor values that could be None;dataclasses: simplifies the creation of data-containing classes;pathlibandos: libraries that work together to manage file operations, such as verifying CSV existence and handling paths across different operating systems;lru_cachefromfunctools: implements a memory cache for frequently used formatting functions, preventing recalculation of previously obtained results and improving performance;ThreadPoolExecutorfromconcurrent.futures: enables parallelization of resource-intensive data processing operations, enhancing analysis performance;re: for data cleaning and standardization, particularly for extracting numerical information from strings such as app prices and sizes;datetime: essential for temporal data analysis, especially for calculating time intervals;pandas: to create and manipulate dataframes;numpy: complements pandas by providing advanced numerical calculation capabilities, such as metrics computation, statistics, and management of multidimensional arrays necessary for app performance analysis;pandas.api.types: implements strict checks on dataframe column data types, ensuring numerical operations are performed only on appropriate data.
For data visualization, I employed two complementary approaches:
plotly(with its express modules,graph_objectsandsubplots) to create interactive and detailed visualizations of Play Store metrics, enabling dynamic data exploration;matplotlib.pyplotandseabornto generate traditional statistical visualizations, particularly valuable for distribution and correlation analysis.
Warnings are suppressed with warnings.filterwarnings('ignore') to maintain clean output, while the logging system is configured through logging.basicConfig() to track operations and errors.
The PlotConfig class, defined using the @dataclass decorator, centralizes visualization configurations.
COLOR_PALETTE maps states like 'primary', 'success', 'warning' to their corresponding hexadecimal color codes, while PLOT_STYLE establishes a consistent style for plots with font, sizes, and basic characteristics.
The __post_init__ method automatically executes after the __init__ method and defines the default values for COLOR_PALETTE and PLOT_STYLE.
I implemented the DataFormatter class to handle data formatting. It contains three static methods, each decorated with @lru_cache(maxsize=1000) for result memorization: format_number(), format_percentage(), and format_currency(), which process numbers with thousand separators, percentages with one decimal place, and monetary values respectively.
The @staticmethod decorator indicates that these are static methods, which can be called directly on the class without instantiation.
The DataLoader class forms the core of data loading. The _is_colab_environment() method detects execution on Google Colab, while _setup_visualization_settings() configures pandas and visualization tool settings.
The main method load_data() handles CSV file loading through a try-except block, logging any errors via the logger.
Initialization occurs with data_loader = DataLoader(), followed by a data loading attempt. A check with if apps_df is None or reviews_df is None verifies successful operation, terminating execution with sys.exit(1) if an error occurs.
Regarding visualization libraries, plotly.express and plotly.graph_objects will create interactive graphics, while matplotlib.pyplot and seaborn will generate traditional statistical visualizations.
The manipulation and analysis of numerical data will rely on numpy and pandas.
# Warning management and logging
import warnings
import logging
from typing import Dict, List, Tuple, Optional, Any, NamedTuple, Union
from dataclasses import dataclass, field
from pathlib import Path
import sys
import os
from functools import lru_cache
from concurrent.futures import ThreadPoolExecutor
import re
from datetime import datetime
# Disable warnings
warnings.filterwarnings('ignore')
# Base logging setup
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Basic libraries for data analysis
import pandas as pd
import numpy as np
from pandas.api.types import is_numeric_dtype
# Libraries for visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns
@dataclass
class PlotConfig:
COLOR_PALETTE: Dict[str, str] = None
PLOT_STYLE: Dict[str, Any] = None
def __post_init__(self):
self.COLOR_PALETTE = {
'primary': '#2c3e50',
'secondary': '#34495e',
'success': '#27ae60',
'warning': '#f39c12',
'danger': '#c0392b',
'info': '#3498db'
}
self.PLOT_STYLE = {
'template': 'plotly_white',
'font_family': 'Arial, sans-serif',
'title_font_size': 20,
'title_x': 0.5,
'showlegend': True
}
class DataFormatter:
@staticmethod
@lru_cache(maxsize=1000)
def format_number(num: Union[int, float]) -> str:
"""Formats numbers with thousand separators"""
return f"{num:,.0f}"
@staticmethod
@lru_cache(maxsize=1000)
def format_percentage(num: Union[int, float]) -> str:
"""Formats percentages with 1 decimal place"""
return f"{num:.1f}%"
@staticmethod
@lru_cache(maxsize=1000)
def format_currency(num: Union[int, float]) -> str:
"""Formats monetary values"""
return f"${num:,.2f}"
class DataLoader:
def __init__(self):
self.plot_config = PlotConfig()
def _is_colab_environment(self) -> bool:
try:
import google.colab
return True
except ImportError:
return False
def _setup_visualization_settings(self):
# Pandas settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# Matplotlib/seaborn settings
plt.style.use('default')
sns.set_theme(style='whitegrid')
def load_data(self) -> Tuple[Optional[pd.DataFrame], Optional[pd.DataFrame]]:
try:
# Setup visualization
self._setup_visualization_settings()
# Determine environment and load data
if self._is_colab_environment():
logger.info("Detected environment: Google Colab")
from google.colab import files
logger.info("Please upload the files 'googleplaystore.csv' and 'googleplaystore_user_reviews.csv'")
uploaded = files.upload()
# Check file presence
required_files = ['googleplaystore.csv', 'googleplaystore_user_reviews.csv']
for file in required_files:
if not os.path.exists(file):
raise FileNotFoundError(f"File {file} not found in the current directory")
apps_df = pd.read_csv('googleplaystore.csv')
reviews_df = pd.read_csv('googleplaystore_user_reviews.csv')
logger.info(f"Datasets loaded successfully!")
logger.info(f"apps_df dimensions: {apps_df.shape}")
logger.info(f"reviews_df dimensions: {reviews_df.shape}")
return apps_df, reviews_df
except Exception as e:
logger.error(f"Error loading data: {str(e)}")
return None, None
# Loader initialization and data loading
data_loader = DataLoader()
apps_df, reviews_df = data_loader.load_data()
# Verify that data has been loaded correctly
if apps_df is None or reviews_df is None:
logger.error("Error loading datasets. Check the presence of files or reload them.")
sys.exit(1)
Saving googleplaystore_user_reviews.csv to googleplaystore_user_reviews.csv Saving googleplaystore.csv to googleplaystore.csv
2. Data reading and validation¶
The second block of code addresses the validation and initial analysis of the two main datasets: apps_df, which contains information about Google Play Store apps, and reviews_df, which contains user reviews.
At the beginning of the cell, I import partial from functools, which I will use to parallelize validation operations.
Then I define a series of specialized classes to handle different aspects of data validation.
The DatasetMetrics class is defined as a @dataclass and serves as a container for the main metrics of a dataset. It includes fields such as:
rowsandcolumnsfor dimensions;missing_datato track the percentages of missing values per column;duplicatesandduplicate_percentagefor duplicate records;dtypesfor column data types;unique_countsto count unique values in each column.
The DataValidator class implements the actual validation logic. Its constructor accepts max_workers to control parallelization.
The _calculate_column_metrics method is decorated with @staticmethod because it doesn't need to access the class instance state. It takes a dataframe (df) and column name (column) as input and returns a dictionary with the following metrics:
type: the data type of the column obtained withdtypeand converted to string;non_null: the number of non-null values calculated withcount();null: the number of null values obtained withisnull().sum();null_perc: the percentage of null values calculated as(null/total)*100and rounded to 2 decimals;unique_values: the number of unique values in the column obtained withnunique().
The validate_dataset method is the core of validation and uses ThreadPoolExecutor to parallelize calculations across columns. Through executor.map and partial, it applies _calculate_column_metrics to all columns simultaneously. It also calculates the number of duplicates in the dataset using duplicated().sum().
I then implemented the DataConsistencyChecker class to verify consistency between the two datasets. Its check_consistency method (decorated with @staticmethod) uses set operations to compare apps present in the datasets:
- creates two sets with
unique()to get the unique apps in each dataset; - uses
intersectionto find apps present in both; - uses the
-operator to identify apps with reviews but absent from the main dataset.
During execution, various useful statistics on the distribution of apps between datasets are logged. The method keeps track of the total number of apps in each dataset, how many are in common, and generates a warning if it finds apps that have reviews but don't exist in the main dataset. The overall data integrity is also verified, recording the number of unique apps in the Play Store and the total records in the reviews. At the end of the verification, the method returns the original dataframes in a tuple, without making any changes.
The InitialAnalyzer class provides a first overview of the data. Its analyze_dataset method separates columns into numeric and categorical using select_dtypes:
- for numeric columns it uses
describe()to obtain basic statistics; - for categorical columns it counts unique values and, if there are fewer than 10, calculates the distribution with
value_counts(normalize=True).
The main function validate_and_analyze_data effectively manages the entire process:
- initializes the necessary classes if not provided;
- performs validation of both datasets;
- verifies their consistency;
- conducts the initial analysis.
Everything is managed in a try-except block to catch and log any errors.
The use of logger throughout the code allows detailed tracking of the process and its results, facilitating the identification of any problems in the data.
from functools import partial
logger = logging.getLogger(__name__)
@dataclass
class DatasetMetrics:
rows: int
columns: int
missing_data: Dict[str, float]
duplicates: int
duplicate_percentage: float
dtypes: Dict[str, str]
unique_counts: Dict[str, int]
class DataValidator:
def __init__(self, max_workers: int = 4):
self.max_workers = max_workers
@staticmethod
def _calculate_column_metrics(df: pd.DataFrame, column: str) -> Dict[str, Any]:
return {
'type': str(df[column].dtype),
'non_null': df[column].count(),
'null': df[column].isnull().sum(),
'null_perc': (df[column].isnull().sum() / len(df) * 100).round(2),
'unique_values': df[column].nunique()
}
def validate_dataset(self, df: pd.DataFrame, dataset_name: str) -> DatasetMetrics:
logger.info(f"\nValidating dataset: {dataset_name}")
logger.info("-" * 50)
# Calculate base metrics
rows, cols = df.shape
logger.info(f"Dimensions: {rows:,} rows, {cols} columns")
# Calculate metrics for each column in parallel
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
column_metrics = dict(zip(
df.columns,
executor.map(partial(self._calculate_column_metrics, df), df.columns)
))
# Calculate duplicates
duplicates = df.duplicated().sum()
duplicate_percentage = (duplicates/len(df)*100).round(2)
logger.info(f"\nDuplicates found: {duplicates:,} ({duplicate_percentage}%)")
return DatasetMetrics(
rows=rows,
columns=cols,
missing_data={col: metrics['null_perc']
for col, metrics in column_metrics.items()},
duplicates=duplicates,
duplicate_percentage=duplicate_percentage,
dtypes={col: metrics['type']
for col, metrics in column_metrics.items()},
unique_counts={col: metrics['unique_values']
for col, metrics in column_metrics.items()}
)
class DataConsistencyChecker:
@staticmethod
def check_consistency(apps_df: pd.DataFrame,
reviews_df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
logger.info("\nVerifying consistency between datasets")
logger.info("-" * 50)
# Check apps present in both datasets
apps_in_store = set(apps_df['App'].unique())
apps_in_reviews = set(reviews_df['App'].unique())
common_apps = apps_in_store.intersection(apps_in_reviews)
missing_apps = apps_in_reviews - apps_in_store
logger.info(f"Apps in Play Store: {len(apps_in_store):,}")
logger.info(f"Apps with reviews: {len(apps_in_reviews):,}")
logger.info(f"Apps in common: {len(common_apps):,}")
if missing_apps:
logger.warning(
f"\nWarning: {len(missing_apps):,} apps have reviews "
"but are not in the main dataset"
)
# Verify integrity
logger.info("\nVerifying data integrity:")
logger.info(f"- Unique apps in Play Store: {apps_df['App'].nunique():,}")
logger.info(f"- Total review records: {len(reviews_df):,}")
return apps_df, reviews_df
class InitialAnalyzer:
@staticmethod
def analyze_dataset(df: pd.DataFrame, dataset_name: str) -> None:
logger.info(f"\nInitial analysis: {dataset_name}")
logger.info("-" * 50)
# Analyze numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 0:
logger.info("\nNumeric column statistics:")
logger.info(df[numeric_cols].describe().round(2))
# Analyze categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns
if len(categorical_cols) > 0:
logger.info("\nCategorical column statistics:")
for col in categorical_cols:
unique_vals = df[col].nunique()
logger.info(f"\n{col}:")
logger.info(f"- Unique values: {unique_vals:,}")
if unique_vals < 10:
dist = df[col].value_counts(normalize=True).head()
logger.info(f"Distribution:\n{dist.round(3)}")
def validate_and_analyze_data(apps_df: pd.DataFrame,
reviews_df: pd.DataFrame,
validator: Optional[DataValidator] = None,
consistency_checker: Optional[DataConsistencyChecker] = None,
initial_analyzer: Optional[InitialAnalyzer] = None) -> Tuple[pd.DataFrame, pd.DataFrame]:
logger.info("=== BEGINNING DATA VALIDATION AND ANALYSIS ===")
# Initialize components if not provided
validator = validator or DataValidator()
consistency_checker = consistency_checker or DataConsistencyChecker()
initial_analyzer = initial_analyzer or InitialAnalyzer()
try:
# Dataset validation
apps_metrics = validator.validate_dataset(apps_df, "Google Play Store Apps")
reviews_metrics = validator.validate_dataset(reviews_df, "App Reviews")
# Consistency check
apps_df, reviews_df = consistency_checker.check_consistency(apps_df, reviews_df)
# Initial analysis
initial_analyzer.analyze_dataset(apps_df, "Google Play Store Apps")
initial_analyzer.analyze_dataset(reviews_df, "App Reviews")
return apps_df, reviews_df
except Exception as e:
logger.error(f"Error during validation and analysis: {str(e)}")
raise
# Execute validation and analysis
apps_df, reviews_df = validate_and_analyze_data(apps_df, reviews_df)
WARNING:__main__: Warning: 54 apps have reviews but are not in the main dataset
3. Data cleaning and preparation¶
The third code cell focuses on data cleaning and preparation, a crucial phase for ensuring that subsequent analysis is based on accurate and well-structured information.
clean_size()converts app sizes to megabytes. It handles cases like 'Varies with device' and automatically converts KB to MB by dividing by 1024 when necessary. It uses regular expressions withre.sub(r'[^0-9.]', '')to extract only numbers from the string;clean_price()standardizes price values in numerical format, converting strings like '$4.99' to decimal values (4.99) and handling special cases like 'Free' which are transformed to 0.0;clean_installs()converts installation numbers to integers, removing characters like commas and the '+' sign (e.g., '1,000,000+' becomes 1000000);clean_android_version()extracts and normalizes the Android version, using a regular expression to find the main numerical format (e.g., from "Android 4.0.3 or up" it extracts 4.0).
Two specialized classes derive from this base class. The first is AppsDataCleaner, dedicated to cleaning the applications dataset. This class implements _parallel_clean_row(), which cleans a single row by applying cleaning methods and adding new clean columns, and clean_apps_dataset(), which coordinates the entire process using ThreadPoolExecutor to parallelize cleaning and improve performance.
During the app cleaning process, several advanced operations are performed:
- data type conversion using
pd.to_numeric()andpd.to_datetime(); - removal of rows with missing values through
dropna(); - feature engineering with the creation of the
Days_Since_Updatevariable, which indicates the number of days elapsed since the last update of each record, providing temporal information useful for data analysis; - categorization of continuous variables like price and installations using
pd.cut(); - calculation of market metrics such as
Market_ShareandCategory_Share.
The second derived class, ReviewsDataCleaner, is specific to the reviews dataset. Although simpler, it handles several tasks:
- cleaning missing values in sentiment columns;
- converting data types for polarity and subjectivity metrics;
- creating the
Review_Lengthfeature to analyze review length; - categorizing sentiment polarity into "Negative", "Neutral", and "Positive".
I should note that these functionalities are not fully utilized in subsequent analyses. Initially, I had planned to develop visualizations and insights based on sentiment and user reviews, but during analysis I found that these data did not produce sufficiently significant or interpretable results for the project objectives. I therefore decided to focus on analyzing app metrics (rating, installations, price) which provided more concrete insights for identifying market opportunities. Despite reviews not being used in subsequent analyses, I have maintained this cleaning code section for methodological completeness and possible future investigations.
Finally, the main function clean_datasets() manages the entire cleaning process. It initializes the necessary cleaners, performs the cleaning of both datasets, and records detailed statistics on the transformations performed. Everything is encapsulated in a try-except block to handle any errors during the process.
The final result is two clean and enriched DataFrames, apps_clean and reviews_clean, ready for subsequent exploratory analyses.
logger = logging.getLogger(__name__)
@dataclass
class CleaningReport:
original_rows: int
cleaned_rows: int
removed_rows: int
missing_before: Dict[str, int]
missing_after: Dict[str, int]
cleaning_steps: list[str]
class DataCleaner:
@staticmethod
def clean_size(size: str) -> Optional[float]:
"""Converts the app size to MB"""
if pd.isna(size) or size == 'Varies with device':
return np.nan
try:
size_str = str(size).strip().upper()
multiplier = 1/1024 if 'K' in size_str else 1
return float(re.sub(r'[^0-9.]', '', size_str)) * multiplier
except (ValueError, AttributeError):
return np.nan
@staticmethod
def clean_price(price: str) -> float:
"""Converts the price to numeric value"""
if pd.isna(price) or price in ['Free', '0', 'Everyone']:
return 0.0
try:
return float(re.sub(r'[^0-9.]', '', str(price)))
except (ValueError, AttributeError):
return 0.0
@staticmethod
def clean_installs(installs: str) -> int:
"""Converts the number of installations to numeric value"""
if pd.isna(installs):
return 0
try:
return int(str(installs).replace(',', '').replace('+', '').strip())
except ValueError:
return 0
@staticmethod
def clean_android_version(version: str) -> Optional[float]:
"""Extracts and normalizes the Android version"""
if pd.isna(version):
return np.nan
try:
match = re.search(r'(\d+\.?\d?)', str(version))
return round(float(match.group(1)), 1) if match else np.nan
except (ValueError, AttributeError):
return np.nan
class AppsDataCleaner(DataCleaner):
def __init__(self, max_workers: int = 4):
self.max_workers = max_workers
self.cleaning_steps = []
def _parallel_clean_row(self, row: pd.Series) -> pd.Series:
"""Cleans a single row of the dataset in parallel"""
row = row.copy()
row['Size_MB'] = self.clean_size(row['Size'])
row['Price_Clean'] = self.clean_price(row['Price'])
row['Installs_Clean'] = self.clean_installs(row['Installs'])
row['Android_Ver_Clean'] = self.clean_android_version(row['Android Ver'])
return row
def clean_apps_dataset(self, df: pd.DataFrame) -> Tuple[pd.DataFrame, CleaningReport]:
"""Cleans the applications dataset"""
logger.info("Cleaning applications dataset in progress...")
df_clean = df.copy()
original_rows = len(df_clean)
missing_before = df_clean.isnull().sum().to_dict()
# Parallel row cleaning
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
cleaned_rows = list(executor.map(self._parallel_clean_row, [row for _, row in df_clean.iterrows()]))
df_clean = pd.DataFrame(cleaned_rows, index=df_clean.index)
# Data type conversion
df_clean['Rating'] = pd.to_numeric(df_clean['Rating'], errors='coerce')
df_clean['Reviews'] = pd.to_numeric(df_clean['Reviews'], errors='coerce')
df_clean['Last Updated'] = pd.to_datetime(df_clean['Last Updated'], errors='coerce')
# Remove rows with critical missing values
df_clean = df_clean.dropna(subset=['Android_Ver_Clean'])
# Feature engineering
df_clean['Days_Since_Update'] = (pd.Timestamp.now() - df_clean['Last Updated']).dt.days
# Categorization
df_clean['Price_Category'] = pd.cut(
df_clean['Price_Clean'],
bins=[-np.inf, 0, 0.99, 2.99, 4.99, np.inf],
labels=['Free', 'Very Low', 'Low', 'Medium', 'Premium']
)
df_clean['Install_Category'] = pd.cut(
df_clean['Installs_Clean'],
bins=[0, 1000, 100000, 1000000, 10000000, np.inf],
labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']
)
# Calculate market metrics
total_installs = df_clean['Installs_Clean'].sum()
df_clean['Market_Share'] = df_clean['Installs_Clean'] / total_installs
df_clean['Category_Share'] = df_clean.groupby('Category')['Installs_Clean'].transform(
lambda x: x / x.sum()
)
# Cleaning report
cleaning_report = CleaningReport(
original_rows=original_rows,
cleaned_rows=len(df_clean),
removed_rows=original_rows - len(df_clean),
missing_before=missing_before,
missing_after=df_clean.isnull().sum().to_dict(),
cleaning_steps=self.cleaning_steps
)
return df_clean, cleaning_report
class ReviewsDataCleaner(DataCleaner):
def clean_reviews_dataset(self, df: pd.DataFrame) -> Tuple[pd.DataFrame, CleaningReport]:
"""Cleans the reviews dataset"""
logger.info("Cleaning reviews dataset in progress...")
df_clean = df.copy()
original_rows = len(df_clean)
missing_before = df_clean.isnull().sum().to_dict()
# Clean missing values
df_clean = df_clean.dropna(subset=['Sentiment', 'Sentiment_Polarity'])
# Data type conversion
df_clean['Sentiment_Polarity'] = pd.to_numeric(df_clean['Sentiment_Polarity'], errors='coerce')
df_clean['Sentiment_Subjectivity'] = pd.to_numeric(df_clean['Sentiment_Subjectivity'], errors='coerce')
# Feature engineering
df_clean['Review_Length'] = df_clean['Translated_Review'].str.len()
# Sentiment categorization
df_clean['Sentiment_Category'] = pd.cut(
df_clean['Sentiment_Polarity'],
bins=[-1, -0.33, 0.33, 1],
labels=['Negative', 'Neutral', 'Positive']
)
# Cleaning report
cleaning_report = CleaningReport(
original_rows=original_rows,
cleaned_rows=len(df_clean),
removed_rows=original_rows - len(df_clean),
missing_before=missing_before,
missing_after=df_clean.isnull().sum().to_dict(),
cleaning_steps=[]
)
return df_clean, cleaning_report
def clean_datasets(apps_df: pd.DataFrame,
reviews_df: pd.DataFrame,
apps_cleaner: Optional[AppsDataCleaner] = None,
reviews_cleaner: Optional[ReviewsDataCleaner] = None) -> Tuple[pd.DataFrame, pd.DataFrame]:
logger.info("=== BEGINNING CLEANING PROCESS ===")
# Initialize cleaners if not provided
apps_cleaner = apps_cleaner or AppsDataCleaner()
reviews_cleaner = reviews_cleaner or ReviewsDataCleaner()
try:
# Clean app dataset
apps_clean, apps_report = apps_cleaner.clean_apps_dataset(apps_df)
logger.info(f"\nApps dataset cleaning completed:")
logger.info(f"Original rows: {apps_report.original_rows:,}")
logger.info(f"Rows after cleaning: {apps_report.cleaned_rows:,}")
logger.info(f"Removed rows: {apps_report.removed_rows:,}")
# Clean reviews dataset
reviews_clean, reviews_report = reviews_cleaner.clean_reviews_dataset(reviews_df)
logger.info(f"\nReviews dataset cleaning completed:")
logger.info(f"Original rows: {reviews_report.original_rows:,}")
logger.info(f"Rows after cleaning: {reviews_report.cleaned_rows:,}")
logger.info(f"Removed rows: {reviews_report.removed_rows:,}")
return apps_clean, reviews_clean
except Exception as e:
logger.error(f"Error during data cleaning: {str(e)}")
raise
# Execute cleaning
apps_clean, reviews_clean = clean_datasets(apps_df, reviews_df)
4. Exploratory data analysis¶
This code cell focuses on the exploratory analysis of cleaned app data to obtain useful market information, and defines classes and functions to perform statistical analysis and generate visualizations.
The first part of the code defines two classes using the @dataclass decorator:
CategoryStats: a class used to store statistical information about app categories. It includes fields such as the number of apps in a category, the average rating, the percentage of paid apps, the average number of installations, and the average size of apps;MarketAnalysis: a class used to contain the results of market analysis. It includes a pandas dataframe with statistics by category, a dictionary with price-related statistics, a pandas dataframe with market competitiveness analysis, and a list to store plotly figures generated by the analysis.
The main class of this section is MarketAnalyzer, which takes the cleaned apps_df dataframe as input during initialization. The class includes methods for calculating statistics, creating visualizations, and performing different types of market analysis.
The __init__ method initializes MarketAnalyzer by filtering rows where Category is '1.9' and sets the number of threads that will be used for parallel calculations via the max_workers argument.
The _calculate_category_stats method is a helper method, decorated with @lru_cache(maxsize=None), which calculates and returns CategoryStats for a given category. The @lru_cache decorator stores the results of this method for efficiency, avoiding redundant calculations when it is called multiple times with the same category.
The analyze_category_distribution method analyzes the distribution of apps across different categories. It uses a ThreadPoolExecutor to parallelize the calculation of statistics by category, creates a pandas dataframe with the calculated statistics, and generates a bar chart using plotly.express to visualize the distribution of apps across categories, using the average rating as a color shade.
The analyze_price_distribution method analyzes the distribution of prices for paid apps. It filters apps_df to include only paid apps, defines price ranges and related labels to group prices into categories, calculates price statistics (such as the total number of apps, the number of paid apps, the percentage of paid apps, and the average, median, and maximum price of paid apps), aggregates data by price range, and generates a bar chart with plotly.graph_objects which shows the price distribution using the percentage of apps for each price range on the y-axis and price ranges on the x-axis. The bars are colored based on the average rating within each price range.
The analyze_market_competition method aims to visually represent the competitiveness of the app market through a map that relates various key metrics.
To build this map, the code begins by grouping the apps_df dataframe by category ("Category") and calculating four fundamental metrics for each:
- the number of apps, which directly indicates the level of competition;
- the average rating, which reflects the quality perceived by users;
- the percentage of paid apps, which represents the propensity to pay in that category;
- the average number of installations, which offers a measure of market breadth.
The payment propensity (paid_perc) is calculated with a lambda expression 'Price_Clean': lambda x: (x > 0).mean() * 100. This formula transforms app prices into boolean values (True for paid apps, False for free apps), calculates their mean (obtaining the proportion of paid apps), and multiplies by 100 to express the result as a percentage.
Specifically:
(x > 0)creates an array of boolean values whereTruerepresents apps with a price greater than zero (paid apps) andFalsethose that are free;.mean()calculates the mean of these boolean values, which is equivalent to the proportion of paid apps (a value between 0 and 1);* 100converts this proportion into a percentage.
This metric is important because it provides an indication of users' willingness to pay for apps in that category. A higher percentage suggests that:
- users in that category are more willing to pay for content and features;
- there is a precedent for direct monetization that new entrants could exploit;
- the "premium" business model (direct payment) might be more easily accepted compared to freemium or advertising-based models.
Subsequently, a "market size index" (market_size_index) is calculated based on the number of apps in each category, normalized between 0 and 1. This index represents the relative size of each category compared to others, where a category with more apps will have a higher index, indicating a larger and potentially more competitive market.
The actual visualization is created with a scatter plot using plotly.graph_objects. Each point on the graph represents a category of apps, positioned according to two main dimensions:
- the x-axis represents the number of apps in the category, or the level of competition. The further right a point is, the greater the number of apps in that category and therefore the higher the competition;
- the y-axis represents the average rating of apps in the category. The higher a point is, the greater the perceived quality of apps in that category.
The size of each point is proportional to the market size index calculated previously. Therefore, larger points indicate categories with a potentially larger market. The color of each point represents the average rating of apps in that category, using a color scale from red (low rating) to green (high rating), providing a visual redundancy that reinforces the information on the y-axis and allows quick identification of categories with apps perceived as superior quality.
Finally, the method calculates an opportunity_score for each category, based on a weighted combination of three factors:
- 40% from the average rating, rewarding categories with higher quality apps;
- 30% from the inverse of the market size index, favoring less competitive categories;
- 30% from the percentage of paid apps, valuing categories where there is a culture of purchasing.
The idea behind this score is that categories with high ratings, low competition, and a high percentage of paid apps potentially represent the best market opportunities, balancing quality, ease of entry, and potential for direct monetization. The visualization has been further enriched with an annotation showing the top 5 categories based on this opportunity score.
The perform_exploratory_analysis function manages the entire process of exploratory data analysis. It initializes MarketAnalyzer if it is not provided in the function arguments, calls the analysis methods of MarketAnalyzer to obtain results and related visualizations, calculates the top 5 market opportunities based on the opportunity score, and returns a MarketAnalysis object containing the results and all generated charts.
Finally, the code performs the exploratory analysis by calling perform_exploratory_analysis - using the cleaned dataframes apps_clean and reviews_clean - and displays the generated figures by iterating through the figures in the market_analysis object and using the show() method to display each plotly chart in the output.
Interpretation of results¶
Distribution of apps by category and average rating¶
The first chart provides an overview of the distribution of apps in the Google Play Store. What immediately stands out is the predominance of the "FAMILY" category, which hosts almost 2000 applications, representing the largest market segment. In second place, we find "GAME" with more than 1000 apps, which reflects the significant popularity of mobile gaming and its ability to generate substantial revenue. Further behind, we find "TOOLS" with about 750 apps, a category that encompasses various utilities and tools.
The visualization reveals an extremely heterogeneous app ecosystem, where a few categories gather most of the applications, while many others represent niches with a much more limited numerical presence. Categories like "EVENTS", "BEAUTY", "PARENTING", and "WEATHER" have a very limited presence, suggesting potential opportunities in less saturated markets.
Particularly interesting is the relationship between the number of apps and the average rating, highlighted by the coloring system. Categories with fewer applications tend to have higher average ratings (displayed in green), as in the case of "EVENTS", "EDUCATION", and "BOOKS_AND_REFERENCE". This phenomenon could indicate that in less crowded markets, it's easier to emerge with quality products that more readily meet user expectations. Conversely, highly competitive categories like "DATING" and "CASINO" show lower average ratings (in orange-red), suggesting greater difficulty in standing out and fully meeting user expectations in typically saturated markets.
Price distribution of paid apps¶
The second chart explores direct monetization strategies through the price distribution of paid apps. The market shows a clear preference for pricing in the $1-2.99 range, which encompasses 36.4% of paid apps. This data suggests a balance point between accessibility for users and perceived value by developers.
It's interesting to note how the distribution doesn't follow a linear decreasing trend: the $0-1 range (20.2%) is more populated than the $3-4.99 range (19.9%), while there's a sharp decline in the $5-9.99 range (11.3%) before a slight rise in the premium category over $10 (12.1%). This pattern reflects different monetization and positioning strategies: many developers opt for very low prices aiming for volume, while others choose premium positioning with high prices targeting users willing to pay for exclusive or very specific features.
The coloring of the bars in the chart offers an additional analytical dimension. Examining the color gradients, it emerges that applications in the lower price ranges present higher average ratings, highlighting an effective correspondence between the price point and user satisfaction. This phenomenon suggests that consumer expectations at these price levels are generally met by the user experience. Conversely, applications positioned in the premium range (over $10) show lower average ratings, which could indicate that users are more critical when paying premium prices and their expectations are harder to meet.
Competitive market map¶
The competitive map represents a sophisticated analytical tool that offers a multidimensional view of the app market. In this scatter plot, each category is positioned based on two fundamental metrics: the number of applications (X-axis) which indicates the level of competition, and the average rating (Y-axis) which reflects user satisfaction.
The competitive landscape appears stratified, with "FAMILY" dominating in quantitative terms with about 2000 apps (extreme right of the chart), followed by "GAME" with about 1000 apps and "TOOLS" with about 750 apps, as was already evident from the chart "Distribution of Apps by Category and Average Rating". The size of the circles, proportional to the market size index, visually amplifies this hierarchy, highlighting the significant weight of these categories in the overall ecosystem.
Particularly interesting is the vertical distribution: categories like "EVENTS", "EDUCATION", and "ART_AND_DESIGN" are located in the upper part of the chart with average ratings above 4.3, suggesting a high perceived quality. At the opposite extreme, categories like "DATING" show more modest ratings, indicating greater difficulty in meeting user expectations.
The analysis is significantly enriched by the information box in the bottom right of the chart, which reveals the top 5 market opportunities according to the opportunity score:
- MEDICAL emerges as the most promising opportunity with a score of 8.67, combining a good rating (4.2), moderate competition (439 apps), and a high propensity to pay (22.6% of paid apps);
- PERSONALIZATION positions itself immediately after with 8.63, thanks to a slightly higher rating (4.3) and a less crowded market (352 apps), maintaining a high percentage of paid apps (22.2%);
- BOOKS_AND_REFERENCE presents a score of 6.06, with an excellent rating (4.3) and low competition (200 apps), although it has a lower propensity to pay (13.5%);
- WEATHER represents an interesting niche with a score of 5.15, combining a good rating (4.2) with minimal competition (only 57 apps) and a decent propensity to pay (10.5%);
- TOOLS, despite high competition (744 apps), maintains a respectable score of 4.57 thanks to a decent rating (4.0) and the presence of a good segment of users willing to pay (9.3%).
This ranking reveals how the best opportunities are not necessarily the categories located in the upper left corner of the chart (high rating, low competition). The propensity to pay plays an important role, making categories like MEDICAL and PERSONALIZATION particularly attractive despite not being the least competitive or those with the highest ratings in absolute terms.
logger = logging.getLogger(__name__)
@dataclass
class CategoryStats:
num_apps: int
avg_rating: float
paid_perc: float
avg_installs: float
avg_size: float
@dataclass
class MarketAnalysis:
category_stats: pd.DataFrame
price_stats: Dict[str, float]
market_analysis: pd.DataFrame
figures: List[go.Figure] = field(default_factory=list)
class MarketAnalyzer:
def __init__(self, apps_df: pd.DataFrame, max_workers: int = 4):
self.apps_df = apps_df[apps_df['Category'] != '1.9'].copy()
self.max_workers = max_workers
@lru_cache(maxsize=None)
def _calculate_category_stats(self, category: str) -> CategoryStats:
cat_data = self.apps_df[self.apps_df['Category'] == category]
return CategoryStats(
num_apps=len(cat_data),
avg_rating=cat_data['Rating'].mean(),
paid_perc=(cat_data['Price_Clean'] > 0).mean() * 100,
avg_installs=cat_data['Installs_Clean'].mean(),
avg_size=cat_data['Size_MB'].mean()
)
def analyze_category_distribution(self) -> Tuple[pd.DataFrame, go.Figure]:
# Parallel calculation of statistics by category
categories = self.apps_df['Category'].unique()
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
stats = list(executor.map(self._calculate_category_stats, categories))
# DataFrame creation
category_stats = pd.DataFrame({
'Category': categories,
'num_apps': [s.num_apps for s in stats],
'avg_rating': [s.avg_rating for s in stats],
'paid_perc': [s.paid_perc for s in stats],
'avg_installs': [s.avg_installs for s in stats],
'avg_size': [s.avg_size for s in stats]
}).round(2)
# Optimized chart creation
fig = px.bar(
category_stats,
x='Category',
y='num_apps',
color='avg_rating',
title='Distribution of apps by category and average rating',
labels={
'num_apps': 'Number of apps',
'Category': 'Category',
'avg_rating': 'Average rating'
},
color_continuous_scale=[[0, '#B30000'], [0.4, '#FF0000'],
[0.6, '#FFA500'], [0.75, '#2ECC40'],
[1, '#00B300']],
range_color=[3.2, 4.8]
)
fig.update_layout(
xaxis_tickangle=-45,
showlegend=True,
height=600,
title_x=0.5,
font=dict(family="Arial", size=12),
margin=dict(t=100, l=50, r=50, b=100)
)
return category_stats.set_index('Category'), fig
def analyze_price_distribution(self) -> Tuple[Dict[str, float], go.Figure]:
df_paid = self.apps_df[self.apps_df['Price_Clean'] > 0].copy()
price_ranges = [0, 1, 2.99, 4.99, 9.99, float('inf')]
price_labels = ['0-1$', '1-2.99$', '3-4.99$', '5-9.99$', '10$+']
df_paid['price_range'] = pd.cut(df_paid['Price_Clean'],
bins=price_ranges,
labels=price_labels)
# Calculate price statistics
price_stats = {
'total_apps': len(self.apps_df),
'paid_apps': len(df_paid),
'paid_percentage': (len(df_paid) / len(self.apps_df)) * 100,
'avg_price': df_paid['Price_Clean'].mean(),
'median_price': df_paid['Price_Clean'].median(),
'max_price': df_paid['Price_Clean'].max()
}
# Data aggregation for the chart
price_distribution = df_paid.groupby('price_range').agg({
'App': 'count',
'Rating': 'mean'
}).reset_index()
price_distribution['percentage'] = (
price_distribution['App'] / len(df_paid)
) * 100
# Chart creation
fig = go.Figure(data=[
go.Bar(
x=price_distribution['price_range'],
y=price_distribution['percentage'],
marker=dict(
color=price_distribution['Rating'],
colorscale=[[0, '#B30000'], [0.4, '#FF0000'],
[0.6, '#FFA500'], [0.75, '#2ECC40'],
[1, '#00B300']],
colorbar=dict(
title="Average rating",
titleside="right",
xpad=30,
len=0.9,
thickness=20
),
cmin=3.2,
cmax=4.8
),
text=price_distribution['percentage'].round(1).astype(str) + '%',
textposition='outside'
)
])
fig.update_layout(
title='Price distribution of paid apps',
title_x=0.5,
xaxis_title='Price range',
yaxis_title='Percentage of apps (%)',
height=500,
yaxis_range=[0, max(price_distribution['percentage']) * 1.1],
bargap=0.2,
font=dict(family="Arial", size=12)
)
return price_stats, fig
def analyze_market_competition(self) -> Tuple[pd.DataFrame, go.Figure]:
"""Analyzes market competitiveness"""
market_analysis = self.apps_df.groupby('Category').agg({
'App': 'count',
'Rating': 'mean',
'Price_Clean': lambda x: (x > 0).mean() * 100,
'Installs_Clean': 'mean'
}).round(2)
market_analysis.columns = ['num_apps', 'avg_rating', 'paid_perc', 'avg_installs']
# Calculate market size index
market_analysis['market_size_index'] = (
(market_analysis['num_apps'] - market_analysis['num_apps'].min()) /
(market_analysis['num_apps'].max() - market_analysis['num_apps'].min())
)
# Chart creation
fig = go.Figure(data=[
go.Scatter(
x=market_analysis['num_apps'],
y=market_analysis['avg_rating'],
mode='markers+text',
text=market_analysis.index,
textposition='top center',
marker=dict(
size=market_analysis['market_size_index'] * 50,
color=market_analysis['avg_rating'],
colorscale=[[0, '#B30000'], [0.4, '#FF0000'],
[0.6, '#FFA500'], [0.75, '#2ECC40'],
[1, '#00B300']],
colorbar=dict(
title="Average rating",
titleside="right",
xpad=30,
len=0.9,
thickness=20
),
cmin=3.2,
cmax=4.8
)
)
])
fig.update_layout(
title='Competitive market map',
title_x=0.5,
xaxis_title='Number of apps (competition)',
yaxis_title='Average rating',
height=600,
showlegend=False,
font=dict(family="Arial", size=12)
)
# Calculate opportunity score
market_analysis['opportunity_score'] = (
market_analysis['avg_rating'] * 0.4 +
(1 - market_analysis['market_size_index']) * 0.3 +
market_analysis['paid_perc'] * 0.3
)
# Add annotation with best opportunities
top_opportunities = market_analysis.nlargest(5, 'opportunity_score')
top_text = "<b>Top 5 market opportunities:</b><br>"
for cat in top_opportunities.index:
score = market_analysis.loc[cat, 'opportunity_score']
rating = market_analysis.loc[cat, 'avg_rating']
apps = market_analysis.loc[cat, 'num_apps']
paid = market_analysis.loc[cat, 'paid_perc']
top_text += f"<b>{cat}</b>: score {score:.2f} (rating {rating:.1f}, apps {apps}, propensity {paid:.1f}%)<br>"
fig.add_annotation(
x=0.99,
y=0.01,
xref="paper",
yref="paper",
xanchor="right",
yanchor="bottom",
text=top_text,
showarrow=False,
font=dict(size=11),
align="left",
bgcolor="rgba(255, 255, 255, 0.9)",
bordercolor="black",
borderwidth=1,
borderpad=6
)
return market_analysis, fig
def perform_exploratory_analysis(apps_df: pd.DataFrame,
reviews_df: Optional[pd.DataFrame] = None,
analyzer: Optional[MarketAnalyzer] = None) -> MarketAnalysis:
logger.info("=== BEGINNING EXPLORATORY ANALYSIS ===")
analyzer = analyzer or MarketAnalyzer(apps_df)
figures = []
try:
# 1. Category distribution analysis
logger.info("\n1. Category distribution analysis")
category_stats, category_fig = analyzer.analyze_category_distribution()
figures.append(category_fig)
# 2. Price analysis
logger.info("\n2. Price distribution analysis")
price_stats, price_fig = analyzer.analyze_price_distribution()
figures.append(price_fig)
# 3. Competitive analysis
logger.info("\n3. Competitive market analysis")
market_analysis, market_fig = analyzer.analyze_market_competition()
figures.append(market_fig)
# Log main results
logger.info("\nTop 5 market opportunities:")
top_opportunities = market_analysis.nlargest(5, 'opportunity_score')
for cat in top_opportunities.index:
logger.info(f"\n{cat}:")
logger.info(f"- Score: {market_analysis.loc[cat, 'opportunity_score']:.2f}")
logger.info(f"- Average rating: {market_analysis.loc[cat, 'avg_rating']:.2f}")
logger.info(f"- Competition: {market_analysis.loc[cat, 'num_apps']:,} apps")
logger.info(f"- % paid apps: {market_analysis.loc[cat, 'paid_perc']:.1f}%")
return MarketAnalysis(
category_stats=category_stats,
price_stats=price_stats,
market_analysis=market_analysis,
figures=figures
)
except Exception as e:
logger.error(f"Error during exploratory analysis: {str(e)}")
raise
# Execute exploratory analysis
market_analysis = perform_exploratory_analysis(apps_clean, reviews_clean)
# Display charts
for fig in market_analysis.figures:
fig.show()
5. Performance analysis and key metrics¶
The fifth cell transitions from an initial descriptive exploration to a more in-depth investigation of relationships between variables and temporal trends. While previous blocks identified market opportunities based on categories, we now explore how different metrics influence each other and how the market has evolved over time.
Two NamedTuple classes are defined as typed containers for results: CorrelationResults collects the Pearson and Spearman correlation matrices along with their visualizations, while TimeMetrics contains aggregated temporal data and the corresponding chart.
The main class PlayStoreAnalyzer is implemented as @dataclass and in its __post_init__ method performs a necessary filtering operation:
self.apps_df = self.apps_df[self.apps_df['Category'] != '1.9'].copy()
This line removes from the dataset a clearly anomalous observation where the "Category" column contains the value '1.9', which does not correspond to any legitimate Google Play Store category. Upon closer examination of the dataset, this record shows data misalignment, with values shifted between columns. This row contains impossible values such as a rating of "19", the string "Free" in the installations column, and other clearly misplaced elements. I decided to completely remove this data that would compromise the reliability of subsequent analyses.
After this initial cleanup, the prepare_data() method transforms raw data into meaningful analytical metrics through several essential operations:
processes temporal information by converting the "Last Updated" column to datetime format, calculating the number of days since the last update relative to the most recent date in the dataset, and extracting the update year into a new column;
applies logarithmic transformations (
np.log1p()) to the installations and size columns. This technique is particularly useful for handling asymmetric distributions or those with wide ranges of values, allowing better visualization and analysis of relationships that would otherwise be difficult to interpret.
The code also calculates market metrics such as global market share (dividing each app's installations by total installations) and share within the category, using pandas' transform function which maintains the original dimensionality of the dataframe.
Additionally, the method provides for integrating sentiment data into the main dataset through _merge_sentiment_data(). This process occurs in three steps: first it joins the review data with app categories via an inner join, then aggregates the data to calculate the average polarity (positivity/negativity) and average subjectivity of reviews for each app, and finally joins this aggregated data to the main dataframe with a left join.
I want to specify again that sentiment metrics (Sentiment_Polarity and Sentiment_Subjectivity) are not actually used in the visualizations and correlation analyses for the reasons expressed previously.
The analyze_correlations() method calculates two types of correlation coefficients that provide complementary perspectives:
Pearson correlation (r = Σ[(x_i - x̄)(y_i - ȳ)] / √[Σ(x_i - x̄)² · Σ(y_i - ȳ)²] - where x_i and y_i are individual observations and x̄ and ȳ are the variable means) measures linear relationships between variables, quantifying how much two variables tend to increase or decrease together proportionally. It is particularly effective for linear relationships, but can be misleading with non-linear relationships or significant outliers;
Spearman correlation (ρ = 1 - (6 · Σd_i²) / [n(n² - 1)] - where d_i is the difference between the ranks of corresponding observations and n is the number of observations) captures monotonic relationships - that is, when one variable increases, the other tends to always change in the same direction - even when they are not strictly linear. Based on the ranks of variables rather than their absolute values, it is more robust against outliers and non-normal distributions, providing a more complete view when analyzing data that often present asymmetric distributions.
The calculation of correlations is performed using pandas' built-in functionalities:
data = self.apps_df[metrics.keys()]
pearson_corr = data.corr(method='pearson').round(3)
spearman_corr = data.corr(method='spearman').round(3)
The round(3) method rounds the correlation coefficients to three decimal places to improve readability.
Using both metrics offers a potentially more complete view of the relationships between variables, allowing identification of both linear and non-linear patterns in the data.
To visualize these correlations, the code creates two main representations. The first is a heatmap that compares the Pearson and Spearman correlation matrices side by side, using a color scale from red (negative correlations) to blue (positive correlations) to visually highlight the strength and direction of relationships. The second is a scatter plot matrix that shows the relationships between specific pairs of variables. In the scatter plot analyzing the relationship between rating and days since last update, the code adds slight random noise to the rating values ("jitter") to avoid point overlaps, then calculates a moving average to highlight the general trend, and limits the y-axis to the 95th percentile to avoid extreme values compressing the visualization. In this second visualization, the point colors are based on the apps' ratings, with a scale ranging from red (low ratings) to green (high ratings), allowing easy identification of patterns in the relationships between variables.
The analyze_temporal_trends() method offers an evolutionary perspective of the app market, aggregating data by update year to understand how key metrics have changed over time.
The analysis begins by calculating the average price of paid apps for each year, excluding free apps to avoid distorting the results. The data is then aggregated by year, calculating the average rating, average size, average installations, and app count for each period.
The visualization of temporal evolution adopts a two-subplot approach that organizes metrics into conceptually coherent groups:
the upper subplot presents product metrics, intrinsic characteristics of apps: average rating (reflecting user-perceived quality), average price (indicating monetization strategies), and average size (which may correlate with feature complexity and richness). These three parameters are visualized with different colored lines for distinction: green for rating, red for price, and blue for size;
the lower subplot visualizes market metrics, indicators of app performance and diffusion: average installations (represented by an orange line showing the average popularity of apps) and total number of apps (displayed as semi-transparent gray bars, providing context on market size evolution).
For the lower subplot, a logarithmic scale is implemented on the Y-axis, transforming exponential increments into linear increments: for example, the transitions from 1000 to 10000, from 10000 to 100000, and from 100000 to 1000000 appear as equal distances in the chart. This approach allows simultaneous visualization of apps with very different popularities and appreciation of proportional rather than absolute changes, revealing relative growth patterns that would otherwise remain hidden in a traditional linear scale.
Finally, the _format_trend_value() method customizes the format based on the metric type: it transforms large installation numbers into more readable formats (K for thousands, M for millions), formats prices with the dollar symbol, adds "MB" to sizes, and appropriately handles missing values with "N/A" notation.
The analyze_play_store() function coordinates the analysis process. Its implementation follows several phases:
the function creates a
PlayStoreAnalyzerinstance, passing the app and review dataframes. This instance contains all the specialized logic for the different analyses. It also prepares two empty data structures: a figures list to collect the generated visualizations and a results dictionary to store the numerical results.it performs the correlation analysis by calling the
analyze_correlations()method within a try-except block for error handling. The results are stored in theresultsdictionary under the key 'correlations' and the generated figures are added to thefigureslist. An important feature is the identification and recording of statistically relevant correlations: the function iterates through the Pearson and Spearman correlation matrices, extracting and logging only relationships with correlation coefficients greater than 0.3 in absolute value.it then proceeds with temporal trend analysis by calling
analyze_temporal_trends(). Again, the results and visualizations are stored. A notable element is the detailed calculation of changes between the first and last available years in the dataset: for average rating, the percentage change is calculated, while for other metrics such as average size, average installations, and number of apps, absolute values at the beginning and end of the analyzed period are recorded.all results and figures are collected in the
resultsdictionary, which is returned as the function output. The try-except block wrapping the entire implementation provides robustness to the analysis. When an exception occurs, the code catches the error by entering the except block, records it in the logging system throughlogger.error(), and re-raises it with theraiseinstruction without parameters.
This approach:
ensures that no error goes unnoticed thanks to logging;
maintains the complete trace of the original error (type, message, and stack trace) and allows the calling code to implement any recovery strategies.
Unlike handling that would "swallow" the error, this technique ensures that problems in the data or analysis process are correctly identified and can be addressed appropriately.
Interpretation of Results¶
Correlation Comparison (Pearson vs Spearman)¶
Analyzing the matrices, several interesting correlations emerge:
Days since last update vs Log size
Pearson: -0.35 / Spearman: -0.33
This negative correlation indicates that more recently updated apps tend to have larger sizes. It could reflect a tendency of developers to release updates that enrich the app with new features, consequently increasing its size.
Log size vs Log installations
Pearson: 0.34 / Spearman: 0.35
This positive correlation suggests that larger apps tend to have more installations. This could indicate that users are willing to download heavier apps when they offer more features or content, or that more successful apps tend to expand over time.
Days since last update vs Log installations
Pearson: -0.19 / Spearman: -0.33
It's interesting to note how this correlation is stronger according to Spearman than Pearson, suggesting a monotonic, but not perfectly linear relationship. Apps updated more frequently tend to have more installations, presumably because regular updates keep the app relevant and attractive to users.
Price vs other metrics
Both matrices show weak correlations between price and other metrics, with values rarely exceeding ±0.20. This suggests that price has a limited relationship with other parameters, which could indicate that pricing strategies are determined by factors other than popularity or technical characteristics of apps.
Rating vs other metrics
Rating also shows generally weak correlations with other metrics, with the exception of a slight negative correlation with days since last update (-0.13 in Pearson, -0.19 in Spearman), suggesting that more recently updated apps tend to have slightly better ratings.
In summary, we can deduce that:
update frequency seems to be an important factor correlated with app success, suggesting the importance of regular maintenance;
app size has a positive correlation with installations, indicating that users might prefer apps richer in features;
price and rating seem to be influenced by more complex factors not directly captured by the other metrics analyzed.
It is important to note, however, that while the correlations are not extremely strong, they still provide useful indications about relationships between app characteristics.
Scatter plots of key relationships¶
These scatter plots greatly enrich the understanding of the correlations analyzed previously:
they visually confirm the nature of relationships, showing not only their strength, but also their shape;
they highlight non-linear patterns, particularly visible in the Rating vs Days since last update graph;
they allow identification of clusters and distributions that simple correlations do not capture.
The strategic implications that emerge from this visual analysis reinforce what has already been observed:
regular updates seem to be fundamental to maintaining high ratings and, potentially, to increasing installations
the positive relationship between app size and their diffusion suggests that users appreciate apps richer in features;
there is an evident interconnection between high rating and greater number of installations, which may indicate how perceived quality can influence the popularity of an app.
Temporal evolution of key metrics¶
The chart offers a dynamic perspective on the app market from 2010 to 2018, divided into two panels: the upper one shows product metrics (rating, price, size) and the lower one market metrics (installations and number of apps).
In the upper panel we observe some trends in app characteristics:
Average size (blue line): shows the most significant growth, going from values close to zero in 2010 to about 25 MB in 2018. This constant and substantial increase reflects the evolution of mobile device hardware capabilities and the growing complexity of modern apps, which incorporate increasingly advanced features, higher quality graphical elements, and multimedia content.
Average price (red line): presents a more irregular trend with a notable surge between 2016 and 2017 (from about $5 to $22), followed by a slight decrease in 2018. This peak could indicate a change in monetization strategies or the entry into the market of premium apps in specific categories. The subsequent decrease suggests a possible market adjustment towards more competitive prices.
Average rating (green line): remains remarkably stable around the value 4 during the entire period, with minimal variations. This stability is interesting considering the significant changes in other metrics and suggests that, despite market evolution, user expectations regarding quality and developers' ability to meet them have remained relatively constant.
In the lower panel, displayed on a logarithmic scale:
Average installations (orange line): show a gradual increase in the analyzed period, with a more marked acceleration in recent years, reaching over a million average installations per app in 2018. This trend suggests increasing smartphone penetration and greater user engagement with apps.
Number of apps (gray bars): highlights exponential growth, reaching several thousand apps in 2018. This growth testifies to the explosion of the app ecosystem and intensifying competition. It is important to note that the logarithmic scale of the graph visually attenuates this growth which is actually much more pronounced than it appears.
Analyzing the two panels jointly, interesting relationships emerge:
increasing size and complexity: the constant increase in the average size of apps has occurred in parallel with the growth of average installations, suggesting that users have not been discouraged by heavier apps, probably because they offer richer and more satisfying experiences and greater functionality;
stability of perceived quality: despite the increase in complexity and size of apps, the average rating has remained stable.
price dynamics: the significant increase in prices between 2016 and 2017, followed by a slight decrease, could reflect more aggressive monetization attempts in a mature market, followed by competitive adjustments.
market saturation: The exponential growth in the number of apps, combined with the more moderate increase in average installations, suggests increasing competition for user attention.
The analyses carried out in this code fragment and the related visualizations offer valuable indications for the launch of a hypothetical new app:
users seem to accept larger apps, provided they offer proportional value;
the market has become extremely competitive, with thousands of apps competing for user attention;
the stability of ratings suggests that quality expectations are well established;
pricing strategies require particular attention, considering the significant changes observed in recent years.
logger = logging.getLogger(__name__)
class CorrelationResults(NamedTuple):
pearson: pd.DataFrame
spearman: pd.DataFrame
figures: List[go.Figure]
class TimeMetrics(NamedTuple):
data: pd.DataFrame
figure: go.Figure
@dataclass
class PlayStoreAnalyzer:
apps_df: pd.DataFrame
reviews_df: Optional[pd.DataFrame] = None
max_workers: int = 4
def __post_init__(self):
self.apps_df = self.apps_df[self.apps_df['Category'] != '1.9'].copy()
self.prepare_data()
def prepare_data(self) -> None:
# Temporal metrics
self.apps_df['Last Updated'] = pd.to_datetime(self.apps_df['Last Updated'])
# Use the most recent date in the dataset as reference
max_date = self.apps_df['Last Updated'].max()
self.apps_df['Days_Since_Update'] = (
max_date - self.apps_df['Last Updated']
).dt.days
self.apps_df['Update_Year'] = self.apps_df['Last Updated'].dt.year
# Logarithmic transformations
self.apps_df['Log_Installs'] = np.log1p(self.apps_df['Installs_Clean'])
self.apps_df['Log_Size'] = np.log1p(self.apps_df['Size_MB'])
# Market metrics
total_installs = self.apps_df['Installs_Clean'].sum()
self.apps_df['market_share'] = self.apps_df['Installs_Clean'] / total_installs
self.apps_df['category_share'] = self.apps_df.groupby('Category')['Installs_Clean'].transform(
lambda x: x / x.sum()
)
# Merge with sentiment if available
if self.reviews_df is not None:
self._merge_sentiment_data()
def _merge_sentiment_data(self) -> None:
sentiment_data = self.reviews_df.merge(
self.apps_df[['App', 'Category']],
on='App',
how='inner'
)
app_sentiment = sentiment_data.groupby(['App', 'Category']).agg({
'Sentiment_Polarity': 'mean',
'Sentiment_Subjectivity': 'mean'
}).reset_index()
self.apps_df = self.apps_df.merge(
app_sentiment,
on=['App', 'Category'],
how='left'
)
@staticmethod
def _create_correlation_heatmap(corr_matrix: pd.DataFrame,
title: str) -> go.Figure:
return go.Figure(
data=go.Heatmap(
z=corr_matrix.values,
x=corr_matrix.columns,
y=corr_matrix.index,
colorscale='RdBu',
zmin=-1,
zmax=1,
text=corr_matrix.values.round(2),
texttemplate='%{text}',
textfont={"size": 10}
),
layout=dict(
title=title,
height=600,
font=dict(family="Arial", size=12)
)
)
def _create_scatter_matrix(self, metrics: Dict[str, str]) -> go.Figure:
scatter_pairs = [
('Rating', 'Log_Installs'),
('Rating', 'Log_Size'),
('Log_Size', 'Log_Installs'),
('Rating', 'Days_Since_Update')
]
fig = make_subplots(
rows=2, cols=2,
subplot_titles=[
f'{metrics[x]} vs {metrics[y]}'
for x, y in scatter_pairs
]
)
# Creating a color scale for days
colorscale = [
[0, 'green'], # 0-30 days
[0.1, 'lightgreen'], # 30-90 days
[0.2, 'yellow'], # 90-180 days
[0.4, 'orange'], # 6 months-1 year
[0.6, 'red'], # 1-2 years
[1.0, 'darkred'] # >2 years
]
for idx, (x, y) in enumerate(scatter_pairs):
row = idx // 2 + 1
col = idx % 2 + 1
hover_text = [
f"App: {app}<br>" +
f"Category: {cat}<br>" +
f"{metrics[x]}: {val_x:.2f}<br>" +
f"{metrics[y]}: {val_y:.2f}<br>" +
f"Price: ${price:.2f}<br>" +
f"Installations: {inst:,.0f}<br>" +
f"Size: {size:.1f}MB<br>" +
f"Days since last update: {days:.0f}"
for app, cat, val_x, val_y, price, inst, size, days in zip(
self.apps_df['App'],
self.apps_df['Category'],
self.apps_df[x],
self.apps_df[y],
self.apps_df['Price_Clean'],
self.apps_df['Installs_Clean'],
self.apps_df['Size_MB'],
self.apps_df['Days_Since_Update']
)
]
fig.add_trace(
go.Scatter(
x=self.apps_df[x],
y=self.apps_df[y],
mode='markers',
marker=dict(
size=4,
opacity=0.6,
color=self.apps_df['Days_Since_Update'],
colorscale=colorscale,
colorbar=dict(
title='Days since<br>last update',
ticktext=['0', '30', '90', '180', '365', '730', '>730'],
tickvals=[0, 30, 90, 180, 365, 730, 1000]
) if idx == 1 else None,
cmin=0,
cmax=1000
),
name=f'{metrics[x]} vs {metrics[y]}',
hovertemplate="%{text}<extra></extra>",
text=hover_text,
showlegend=False
),
row=row, col=col
)
# Updating axes
fig.update_xaxes(title=metrics[x], row=row, col=col, gridcolor='lightgray', showgrid=True)
fig.update_yaxes(title=metrics[y], row=row, col=col, gridcolor='lightgray', showgrid=True)
fig.update_layout(
title='Scatter plots of main relationships',
height=800,
width=1000,
showlegend=False,
title_x=0.5,
hovermode='closest',
plot_bgcolor='white',
margin=dict(t=100, l=50, r=50, b=50)
)
return fig
def analyze_correlations(self) -> Tuple[pd.DataFrame, pd.DataFrame, List[go.Figure]]:
metrics = {
'Rating': 'Rating',
'Price_Clean': 'Price',
'Log_Installs': 'Log installations',
'Log_Size': 'Log size',
'Days_Since_Update': 'Days since last update'
}
# Calculate correlations
data = self.apps_df[metrics.keys()]
pearson_corr = data.corr(method='pearson').round(3)
spearman_corr = data.corr(method='spearman').round(3)
# Rename columns
for corr_matrix in [pearson_corr, spearman_corr]:
corr_matrix.columns = metrics.values()
corr_matrix.index = metrics.values()
# Create graphs
figures = []
# Correlation heatmap
heatmap_fig = make_subplots(
rows=1, cols=2,
subplot_titles=('Pearson Correlations', 'Spearman Correlations'),
horizontal_spacing=0.15
)
# Add Pearson heatmap
heatmap_fig.add_trace(
go.Heatmap(
z=pearson_corr.values,
x=pearson_corr.columns,
y=pearson_corr.index,
colorscale='RdBu',
zmin=-1,
zmax=1,
text=pearson_corr.values.round(2),
texttemplate='%{text}',
textfont={"size": 10}
),
row=1, col=1
)
# Add Spearman heatmap
heatmap_fig.add_trace(
go.Heatmap(
z=spearman_corr.values,
x=spearman_corr.columns,
y=spearman_corr.index,
colorscale='RdBu',
zmin=-1,
zmax=1,
text=spearman_corr.values.round(2),
texttemplate='%{text}',
textfont={"size": 10}
),
row=1, col=2
)
heatmap_fig.update_layout(
title='Pearson vs Spearman correlations comparison',
title_x=0.5,
height=600,
width=1500, # Increased to 1500
font=dict(family="Arial", size=12),
margin=dict(t=100, l=100, r=100, b=50)
)
figures.append(heatmap_fig)
# Scatter matrix
scatter_pairs = [
('Rating', 'Log_Installs'),
('Rating', 'Log_Size'),
('Log_Size', 'Log_Installs'),
('Rating', 'Days_Since_Update')
]
scatter_fig = make_subplots(
rows=2, cols=2,
subplot_titles=[
f'{metrics[x]} vs {metrics[y]}'
for x, y in scatter_pairs
],
horizontal_spacing=0.15,
vertical_spacing=0.15
)
# Creating a color scale based on rating
colorscale = [
[0, 'red'], # Rating 1
[0.25, 'orange'], # Rating 2
[0.5, 'yellow'], # Rating 3
[0.75, 'lightgreen'], # Rating 4
[1, 'green'] # Rating 5
]
for idx, (x, y) in enumerate(scatter_pairs):
row = idx // 2 + 1
col = idx % 2 + 1
hover_text = [
f"App: {app}<br>" +
f"Category: {cat}<br>" +
f"{metrics[x]}: {val_x:.2f}<br>" +
f"{metrics[y]}: {val_y:.2f}<br>" +
f"Rating: {rating:.1f}<br>" +
f"Price: ${price:.2f}<br>" +
f"Installations: {inst:,.0f}<br>" +
f"Size: {size:.1f}MB<br>" +
f"Days since last update: {days:.0f}"
for app, cat, val_x, val_y, rating, price, inst, size, days in zip(
self.apps_df['App'],
self.apps_df['Category'],
self.apps_df[x],
self.apps_df[y],
self.apps_df['Rating'],
self.apps_df['Price_Clean'],
self.apps_df['Installs_Clean'],
self.apps_df['Size_MB'],
self.apps_df['Days_Since_Update']
)
]
if y == 'Days_Since_Update':
# Add jitter to rating to avoid overlapping
jittered_x = self.apps_df[x] + np.random.normal(0, 0.05, len(self.apps_df))
# Calculate moving average
rating_range = np.arange(1, 5.1, 0.1)
days_mean = []
for r in rating_range:
mask = (self.apps_df[x] >= r - 0.2) & (self.apps_df[x] < r + 0.2)
mean_val = self.apps_df.loc[mask, y].mean()
days_mean.append(mean_val)
scatter_fig.add_trace(
go.Scatter(
x=jittered_x,
y=self.apps_df[y],
mode='markers',
marker=dict(
size=3,
opacity=0.3,
color=self.apps_df['Rating'],
colorscale=colorscale,
colorbar=dict(
title='Rating',
ticktext=['1', '2', '3', '4', '5'],
tickvals=[1, 2, 3, 4, 5]
) if idx == 1 else None,
cmin=1,
cmax=5
),
name=f'{metrics[x]} vs {metrics[y]}',
hovertemplate="%{text}<extra></extra>",
text=hover_text,
showlegend=False
),
row=row, col=col
)
# Add moving average line
scatter_fig.add_trace(
go.Scatter(
x=rating_range,
y=days_mean,
mode='lines',
line=dict(color='black', width=2),
name='Moving average',
showlegend=False
),
row=row, col=col
)
# Update layout for this specific subplot
scatter_fig.update_xaxes(
title=metrics[x],
row=row,
col=col,
gridcolor='lightgray',
showgrid=True,
range=[0.5, 5.5]
)
# Use 95th percentile for y-axis
y_max = self.apps_df[y].quantile(0.95)
scatter_fig.update_yaxes(
title=metrics[y],
row=row,
col=col,
gridcolor='lightgray',
showgrid=True,
range=[0, y_max]
)
else:
scatter_fig.add_trace(
go.Scatter(
x=self.apps_df[x],
y=self.apps_df[y],
mode='markers',
marker=dict(
size=4,
opacity=0.6,
color=self.apps_df['Rating'],
colorscale=colorscale,
colorbar=dict(
title='Rating',
ticktext=['1', '2', '3', '4', '5'],
tickvals=[1, 2, 3, 4, 5]
) if idx == 1 else None,
cmin=1,
cmax=5
),
name=f'{metrics[x]} vs {metrics[y]}',
hovertemplate="%{text}<extra></extra>",
text=hover_text,
showlegend=False
),
row=row, col=col
)
scatter_fig.update_xaxes(
title=metrics[x],
row=row,
col=col,
gridcolor='lightgray',
showgrid=True
)
scatter_fig.update_yaxes(
title=metrics[y],
row=row,
col=col,
gridcolor='lightgray',
showgrid=True
)
scatter_fig.update_layout(
title='Scatter plots of main relationships',
height=800,
width=1500, # Increased to 1500
showlegend=False,
title_x=0.5,
hovermode='closest',
plot_bgcolor='white',
margin=dict(t=100, l=50, r=50, b=50)
)
figures.append(scatter_fig)
return pearson_corr, spearman_corr, figures
def analyze_temporal_trends(self) -> TimeMetrics:
# Preliminary price check
avg_price = self.apps_df.groupby('Update_Year').agg({
'Price_Clean': lambda x: x[x > 0].mean() if len(x[x > 0]) > 0 else np.nan
})
# Efficient temporal metrics aggregation
time_metrics = self.apps_df.groupby('Update_Year').agg({
'Rating': 'mean',
'Size_MB': 'mean',
'Installs_Clean': 'mean',
'App': 'count'
}).round(2)
time_metrics['Price_Clean'] = avg_price['Price_Clean']
# Optimized figure creation with subplot
fig = make_subplots(
rows=2,
cols=1,
row_heights=[0.6, 0.4],
vertical_spacing=0.12,
subplot_titles=(
'Product Metrics (Rating, Price, Size)',
'Market Metrics (Installations and number of apps)'
)
)
# Subplot 1 trace configuration
traces_subplot1 = [
('Rating', 'Average Rating', '#2ECC40'),
('Price_Clean', 'Average Price ($)', '#FF4136'),
('Size_MB', 'Average Size (MB)', '#0074D9')
]
# Add subplot 1 traces
for col, name, color in traces_subplot1:
data = time_metrics[col].fillna(0)
hover_text = [
f"Year: {year}<br>{name}: {self._format_trend_value(val, col.lower())}"
for year, val in zip(time_metrics.index, time_metrics[col])
]
fig.add_trace(
go.Scatter(
x=time_metrics.index,
y=data,
name=name,
line=dict(color=color, width=2),
hovertemplate="%{text}<extra></extra>",
text=hover_text
),
row=1,
col=1
)
# Add number of apps bars to subplot 2
hover_text_app = [
f"Year: {year}<br>Number of apps: {self._format_trend_value(val, 'app')}"
for year, val in zip(time_metrics.index, time_metrics['App'])
]
fig.add_trace(
go.Bar(
x=time_metrics.index,
y=time_metrics['App'],
name='Number of apps',
marker_color='#AAAAAA',
opacity=0.3,
width=0.5,
hovertemplate="%{text}<extra></extra>",
text=hover_text_app
),
row=2,
col=1
)
# Add installations line to subplot 2
hover_text_inst = [
f"Year: {year}<br>Average Installations: {self._format_trend_value(val, 'installations')}"
for year, val in zip(time_metrics.index, time_metrics['Installs_Clean'])
]
fig.add_trace(
go.Scatter(
x=time_metrics.index,
y=time_metrics['Installs_Clean'],
name='Average Installations',
line=dict(color='#FF851B', width=2),
hovertemplate="%{text}<extra></extra>",
text=hover_text_inst
),
row=2,
col=1
)
# Layout optimization
fig.update_layout(
title={
'text': 'Temporal evolution of key metrics',
'y': 0.98,
'x': 0.5,
'xanchor': 'center',
'yanchor': 'top'
},
height=900,
showlegend=True,
legend=dict(
orientation='h',
yanchor='bottom',
y=1.05,
xanchor='center',
x=0.5,
bgcolor='rgba(255, 255, 255, 0.8)',
bordercolor='lightgray',
borderwidth=1
),
plot_bgcolor='white',
hovermode='x unified',
margin=dict(t=120, b=50, l=50, r=50)
)
# Axes optimization
for row in [1, 2]:
fig.update_xaxes(
title='Year',
showgrid=True,
gridwidth=1,
gridcolor='lightgray',
row=row
)
fig.update_yaxes(
title='Value',
showgrid=True,
gridwidth=1,
gridcolor='lightgray',
row=1,
col=1
)
fig.update_yaxes(
title='Number',
showgrid=True,
gridwidth=1,
gridcolor='lightgray',
type='log',
row=2,
col=1
)
return TimeMetrics(time_metrics, fig)
def _format_trend_value(self, value: float, metric_type: str) -> str:
if pd.isna(value):
return "N/A"
if metric_type == "installations":
if value >= 1e6:
return f"{value/1e6:.1f}M"
elif value >= 1e3:
return f"{value/1e3:.1f}K"
return f"{value:.0f}"
elif metric_type == "size":
return f"{value:.1f}MB"
elif metric_type == "price":
return f"${value:.2f}"
elif metric_type == "rating":
return f"{value:.2f}"
elif metric_type == "app":
return f"{int(value):,}"
return str(value)
# This function should be outside the class
def analyze_play_store(apps_df: pd.DataFrame, reviews_df: Optional[pd.DataFrame] = None) -> Dict[str, Any]:
logger.info("=== GOOGLE PLAY STORE ANALYSIS ===")
# Initializing analyzer
analyzer = PlayStoreAnalyzer(apps_df, reviews_df)
figures = []
results = {}
try:
# 1. Correlation analysis
logger.info("\n1. Analysis of correlations between metrics")
pearson_corr, spearman_corr, corr_figures = analyzer.analyze_correlations()
results['correlations'] = {'pearson': pearson_corr, 'spearman': spearman_corr}
figures.extend(corr_figures)
# Log correlations
for method, corr_matrix in [('Pearson', pearson_corr), ('Spearman', spearman_corr)]:
logger.info(f"\nSignificant {method} correlations (|corr| > 0.3):")
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
corr = corr_matrix.iloc[i, j]
if abs(corr) > 0.3:
logger.info(
f"{corr_matrix.index[i]} vs {corr_matrix.columns[j]}: {corr:.3f}"
)
# 2. Temporal analysis
logger.info("\n2. Analysis of temporal trends")
time_metrics, time_fig = analyzer.analyze_temporal_trends()
results['temporal'] = time_metrics
figures.append(time_fig)
# Log main trends
first_metrics = time_metrics.iloc[0]
last_metrics = time_metrics.iloc[-1]
rating_change = ((last_metrics['Rating'] - first_metrics['Rating']) /
first_metrics['Rating'] * 100)
logger.info("\nMain trends:")
logger.info(
f"Average rating: {rating_change:+.1f}% change "
f"(from {first_metrics['Rating']:.2f} to {last_metrics['Rating']:.2f})"
)
logger.info(
f"Average size: from {first_metrics['Size_MB']:.1f}MB to "
f"{last_metrics['Size_MB']:.1f}MB"
)
def format_installs(val):
return f"{val/1e6:.1f}M" if val >= 1e6 else f"{val/1e3:.1f}K"
logger.info(
f"Average installations: from {format_installs(first_metrics['Installs_Clean'])} to "
f"{format_installs(last_metrics['Installs_Clean'])}"
)
logger.info(
f"Number of apps: from {int(first_metrics['App']):,} to "
f"{int(last_metrics['App']):,}"
)
if pd.notna(first_metrics['Price_Clean']) and pd.notna(last_metrics['Price_Clean']):
logger.info(
f"Average price: from ${first_metrics['Price_Clean']:.2f} to "
f"${last_metrics['Price_Clean']:.2f}"
)
else:
logger.info("Average price: data not available")
results['figures'] = figures
return results
except Exception as e:
logger.error(f"Error during Play Store analysis: {str(e)}")
raise
# Run the analysis - this should also be outside the class
analysis_results = analyze_play_store(apps_clean, reviews_clean)
# Show figures
for fig in analysis_results['figures']:
fig.show()